16 research outputs found
A mathematical programming approach to SVM-based classification with label noise
The authors of this research acknowledge financial support by the Spanish Ministerio de Ciencia y Tecnologia, Agencia Estatal de Investigacion and Fondos
Europeos de Desarrollo Regional (FEDER) via project PID2020114594GB-C21. The authors also acknowledge partial support from projects FEDER-US-1256951,
Junta de Andaluc铆a P18-FR-1422, CEI-3-FQM331, NetmeetData: Ayudas Fundaci贸n BBVA a equipos de investigaci贸n cient铆fica 2019. The first author was
also supported by projects P18-FR-2369 (Junta de Andaluc铆a) and IMAG-Maria de Maeztu grant CEX2020-001105-M /AEI /10.13039/501100011033. (Spanish
Ministerio de Ciencia y Tecnologia).In this paper we propose novel methodologies to optimally construct Support Vector Machine-based classifiers that take into account that label noise occur in the training sample. We propose different alternatives based on solving Mixed Integer Linear and Non Linear models by incorporating decisions on relabeling some of the observations in the training dataset. The first method incorporates relabeling directly in the SVM model while a second family of methods combines clustering with classification at the same time, giving rise to a model that applies simultaneously similarity measures and SVM. Extensive computational experiments are reported based on a battery of standard datasets taken from UCI Machine Learning repository, showing the effectiveness of the proposed approaches.Spanish Ministerio de Ciencia y Tecnologia, Agencia Estatal de Investigacion and Fondos
Europeos de Desarrollo Regional (FEDER) via project PID2020114594GB-C21FEDER-US-1256951Junta de Andaluc铆a P18-FR-1422CEI-3-FQM331NetmeetData: Ayudas Fundaci贸n BBVA a equipos de investigaci贸n cient铆fica 2019Project P18-FR-2369 Junta de Andaluc铆aIMAG-Maria de Maeztu grant CEX2020-001105-M /AEI /10.13039/501100011033. (Spanish
Ministerio de Ciencia y Tecnologia
Multiclass optimal classification trees with SVM鈥憇plits
In this paper we present a novel mathematical optimization-based methodology to construct
tree-shaped classification rules for multiclass instances. Our approach consists of
building Classification Trees in which, except for the leaf nodes, the labels are temporarily
left out and grouped into two classes by means of a SVM separating hyperplane. We provide
a Mixed Integer Non Linear Programming formulation for the problem and report the
results of an extended battery of computational experiments to assess the performance of
our proposal with respect to other benchmarking classification methods.Universidad de Sevilla/CBUASpanish Ministerio de Ciencia y Tecnolog铆a, Agencia Estatal de Investigaci贸n, and
Fondos Europeos de Desarrollo Regional (FEDER) via project PID2020-114594GB-C21Junta de Andaluc铆a
projects FEDER-US-1256951, P18-FR-1422, CEI-3-FQM331, B-FQM-322-UGR20AT 21_00032;
Fundaci贸n BBVA through project NetmeetData: Big Data 2019UE-NextGenerationEU (ayudas de movilidad
para la recualificaci贸n del profesorado universitario)IMAG-Maria de Maeztu grant CEX2020-
001105-M /AEI /10.13039/50110001103
Juegos de evasi贸n
The goals of the work are twofold. The first one is about the background of mathematics of pursuit and evasion, models that are known as Chases and Escapes games. Five problems are solved using clasic techniques from Mathematic Analysis, and a sixth one of a different nature where its
solution is described from a probabilistic point of view. The second one, of divulgative character and realized in colaboration with an Architecture
student, consists in the design of a real park where the problems mentioned
before can be brought into practise. The park, which is supposed to be placed in Seville, becomes as an idea to bring closer the world of mathematics to people from the city throughout a pleasant experience, showing that mathematics attached to a logical reasoning are a powerfull tool.Universidad de Sevilla. Grado en Matem谩tica
On the multisource hyperplanes location problem to fitting set of points
In this paper we study the problem of locating a given number of hyperplanes
minimizing an objective function of the closest distances from a set of points.
We propose a general framework for the problem in which norm-based distances
between points and hyperplanes are aggregated by means of ordered median
functions. A compact Mixed Integer Linear (or Non Linear) programming
formulation is presented for the problem and also an extended set partitioning
formulation with an exponential number of variables is derived. We develop a
column generation procedure embedded within a branch-and-price algorithm for
solving the problem by adequately performing its preprocessing, pricing and
branching. We also analyze geometrically the optimal solutions of the problem,
deriving properties which are exploited to generate initial solutions for the
proposed algorithms. Finally, the results of an extensive computational
experience are reported. The issue of scalability is also addressed showing
theoretical upper bounds on the errors assumed by replacing the original
datasets by aggregated versions.Comment: 30 pages, 5 Tables, 3 Figure
A Mathematical Programming Approach to Optimal Classification Forests
In this paper, we introduce Optimal Classification Forests, a new family of
classifiers that takes advantage of an optimal ensemble of decision trees to
derive accurate and interpretable classifiers. We propose a novel mathematical
optimization-based methodology in which a given number of trees are
simultaneously constructed, each of them providing a predicted class for the
observations in the feature space. The classification rule is derived by
assigning to each observation its most frequently predicted class among the
trees in the forest. We provide a mixed integer linear programming formulation
for the problem. We report the results of our computational experiments, from
which we conclude that our proposed method has equal or superior performance
compared with state-of-the-art tree-based classification methods. More
importantly, it achieves high prediction accuracy with, for example, orders of
magnitude fewer trees than random forests. We also present three real-world
case studies showing that our methodology has very interesting implications in
terms of interpretability.Comment: 24 pages, 9 figures, 1 tabl
Robust optimal classification trees under noisy labels
This research has been partially supported by Spanish Ministerio de Ciencia e Innovacion, Agencia Estatal de Investigacion/FEDER grant number PID2020-114594GBC21, Junta de Andalucia projects P18-FR-1422, P18-FR-2369 and projects FEDERUS-1256951, BFQM-322-UGR20, CEI-3-FQM331 and NetmeetData-Ayudas Fundacion BBVA a equipos de investigacion cientifica 2019. The first author was also partially supported by the IMAG-Maria de Maeztu grant CEX2020-001105-M /AEI /10.13039/501100011033.In this paper we propose a novel methodology to construct Optimal Classification
Trees that takes into account that noisy labels may occur in the training sample. The
motivation of this new methodology is based on the superaditive effect of combining
together margin based classifiers and outlier detection techniques. Our approach rests
on two main elements: (1) the splitting rules for the classification trees are designed
to maximize the separation margin between classes applying the paradigm of SVM;
and (2) some of the labels of the training sample are allowed to be changed during the
construction of the tree trying to detect the label noise. Both features are considered
and integrated together to design the resulting Optimal Classification Tree.We present
a Mixed Integer Non Linear Programming formulation for the problem, suitable to
be solved using any of the available off-the-shelf solvers. The model is analyzed and
tested on a battery of standard datasets taken from UCI Machine Learning repository,
showing the effectiveness of our approach. Our computational results show that in
most cases the new methodology outperforms both in accuracy and AUC the results
of the benchmarks provided by OCT and OCT-H.Spanish Ministerio de Ciencia e Innovacion, Agencia Estatal de Investigacion/FEDER PID2020-114594GBC21Junta de Andalucia P18-FR-1422
P18-FR-2369NetmeetData-Ayudas Fundacion BBVA a equipos de investigacion cientifica 2019IMAG-Maria de Maeztu CEX2020-001105-M /AEI /10.13039/501100011033FEDERUS-1256951
BFQM-322-UGR20
CEI-3-FQM33
New Advances In Data Science Problems Through Hyperplanes Location
This thesis dissertation focus on developing new approaches for different Data Science problems from a Location Theory perspective. In particular, we concentrate on locating hyperplanes by means of solving Mixed Integer Linear and Non Linear Problems.
Chapter 1 introduces the baseline techniques involved in this work, which encompass Support Vector Machines, Decision Trees and Fitting Hyperplanes Theory.
In Chapter 2 we study the problem of locating a set of hyperplanes for multiclass classification problems, extending the binary Support Vector Machines paradigm. We present four Mathematical Programming formulations which allow us to vary the error measures involved in the problems as well as the norms used to measure distances. We report an extensive battery of computational experiment over real and synthetic datasets which reveal the powerfulness of our approach. Moreover, we prove that the kernel trick can be applicable in our method.
Chapter 3 also focus on locating a set of hyperplanes, in this case, aiming to minimize an objective function of the closest distances from a set of points. The problem is treated in a general framework in which norm-based distances between points and hyperplanes are aggregated by means of ordered median functions. We present a compact formulation and also a set partitioning one. A column generation procedure is developed in order to solve the set partitioning problem. We report the results of an extensive computational experience, as well as theoretical results over the scalability issues and geometrical analysis of the optimal solutions.
Chapter 4 addresses the problem of finding a separating hyperplane for binary classification problems in which label noise is considered to occur over the training sample. We derive three methodologies, two of them based on clustering techniques, which incorporate the ability of relabeling observations, i.e., treating them as if they belong to their contrary class, during the training process. We report computational experiments that show how our methodologies obtain higher accuracies when training samples contain label noise.
Chapters 5 and 6 consider the problem of locating a set of hyperplanes, following the Support Vector Machines classification principles, in the context of Classification Trees. The methodologies developed in both chapters inherit properties from Chapter 4, which play an important role in the problems formulations. On the one hand, Chapter 5 focuses on binary classification problems where label noise can occur in training samples. On the other hand, Chapter 6 focus on solving the multiclass classification problem. Both chapters present the results of our computational experiments which show how the methodologies derived outperform other Classification Trees methodologies. Finally, Chapter 7 presents the conclusions of this thesis